Goto

Collaborating Authors

 intermediate neuron



Logarithmic Pruning is All You Need

Neural Information Processing Systems

The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network.


Reviews: Full-Gradient Representation for Neural Network Visualization

Neural Information Processing Systems

Updates based on author feedback: Given that the authors added the digit flipping experiments and obtained good results (albeit with different choices of post-processing) I am increasing my score to a 7. However, ***the increased score is based on good faith that the authors will add the following to their paper***: (1) I think it would be extremely helpful to practitioners if the authors exposed how different choices of post-processing affected the results. What happens to the digit flipping experiments when sign information is discarded? What happens to Remove & Retrain when sign information is retained? Please be as up front as possible about the caveats; practitioners should be made aware that the choice of post-processing is something they need to pay close attention to, or the method may be used in a way that gives misleading results (as mentioned, we have seen this happen before with Guided Backprop).


Manipulating Sparse Double Descent

Zhang, Ya Shi

arXiv.org Artificial Intelligence

This paper investigates the double descent phenomenon in two-layer neural networks, focusing on the role of L1 regularization and representation dimensions. It explores an alternative double descent phenomenon, named'sparse double descent'. The study emphasizes the complex relationship between model complexity, sparsity, and generalization, and suggests further research into more diverse models and datasets. The findings contribute to a deeper understanding of neural network training and optimization.


Efficient Adversarial Training with Robust Early-Bird Tickets

Xi, Zhiheng, Zheng, Rui, Gui, Tao, Zhang, Qi, Huang, Xuanjing

arXiv.org Artificial Intelligence

Adversarial training is one of the most powerful methods to improve the robustness of pre-trained language models (PLMs). However, this approach is typically more expensive than traditional fine-tuning because of the necessity to generate adversarial examples via gradient descent. Delving into the optimization process of adversarial training, we find that robust connectivity patterns emerge in the early training phase (typically $0.15\sim0.3$ epochs), far before parameters converge. Inspired by this finding, we dig out robust early-bird tickets (i.e., subnetworks) to develop an efficient adversarial training method: (1) searching for robust tickets with structured sparsity in the early stage; (2) fine-tuning robust tickets in the remaining time. To extract the robust tickets as early as possible, we design a ticket convergence metric to automatically terminate the searching process. Experiments show that the proposed efficient adversarial training method can achieve up to $7\times \sim 13 \times$ training speedups while maintaining comparable or even better robustness compared to the most competitive state-of-the-art adversarial training methods.


EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets

Chen, Xiaohan, Cheng, Yu, Wang, Shuohang, Gan, Zhe, Wang, Zhangyang, Liu, Jingjing

arXiv.org Artificial Intelligence

Deep, heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many natural language processing (NLP) tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focusing on reducing inference time while still requiring an expensive training process. Other works use extremely large batch sizes to shorten the pre-training time, at the expense of higher computational resource demands. In this paper, inspired by the Early-Bird Lottery Tickets recently studied for computer vision tasks, we propose EarlyBERT, a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. By slimming the self-attention and fully-connected sub-layers inside a transformer, we are the first to identify structured winning tickets in the early stage of BERT training. We apply those tickets towards efficient BERT training, and conduct comprehensive pre-training and fine-tuning experiments on GLUE and SQuAD downstream tasks. Our results show that EarlyBERT achieves comparable performance to standard BERT, with 35 45% less training time. Large-scale pre-trained language models (e.g., BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), T5 (Raffel et al., 2019)) have significantly advanced the state of the art in the NLP field.


Logarithmic Pruning is All You Need

Orseau, Laurent, Hutter, Marcus, Rivasplata, Omar

arXiv.org Machine Learning

The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds: the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.


Just One View: Invariances in Inferotemporal Cell Tuning

Riesenhuber, Maximilian, Poggio, Tomaso

Neural Information Processing Systems

In macaque inferotemporal cortex (IT), neurons have been found to respond selectively to complex shapes while showing broad tuning ("invariance") with respect to stimulus transformations such as translation and scale changes and a limited tuning to rotation in depth.


Just One View: Invariances in Inferotemporal Cell Tuning

Riesenhuber, Maximilian, Poggio, Tomaso

Neural Information Processing Systems

In macaque inferotemporal cortex (IT), neurons have been found to respond selectively to complex shapes while showing broad tuning ("invariance") with respect to stimulus transformations such as translation and scale changes and a limited tuning to rotation in depth.


Just One View: Invariances in Inferotemporal Cell Tuning

Riesenhuber, Maximilian, Poggio, Tomaso

Neural Information Processing Systems

In macaque inferotemporal cortex (IT), neurons have been found to respond selectivelyto complex shapes while showing broad tuning ("invariance") withrespect to stimulus transformations such as translation and scale changes and a limited tuning to rotation in depth.